AIML Module Project - UNSUPERVISED LEARNING - Project 1

DOMAIN :- Automobile


Import and warehouse data

Task: Import all the given datasets and explore shape and size of each.

Data Cleaning

Task: Missing/incorrect value treatment

Out of the 398 rows 6 have ? in the hp column. We could drop those 6 rows - which might not be a good idea under all situations. Here, we will replace them with their median values. First replace '?' with NaN and then replace NaN with median

Task: Droping the unwanted arritributes

Data analysis & visualisation:

hp and acc have some outliers in the dataset.

Also Disp is high positive with hp and wt

from this we can see high possible multi collinearity availble in this data set.

From VIF value we can see high multi collinear is possible and we can able to reduce the dimensionality further.

Applying PCA

Dimensionality Reduction

Now 3 dimensions seems very reasonable. With 3 variables we can explain over 95% of the variation in the original data!

The above Pair plot shows the reduced data. With 3 feature we can obtain the high postive result from the original dataset

Machine Learning

K - means Clustering

In the elbow graph, we see bend at k=3 and k=5 has cluster formation which has more independent

Analyse with box plox by grouping

It is seen from the graph that group 1,2 & 3 are not overlapping except in yr column

Hierarchical Clustering

From K- means clustering and Hierarchical clustering,

Hierarchical cluster is not well designed for high data points.

On seeing the Groups and Label mean for K=3 , it is more similar

Mention how many optimal clusters are present in the data and what could be the possible reason behind it.

The data has 3 optimal cluster...

On seeing pair plot , we can see the max of 3 peaks in the graph. which means there are more data oriented to that 3 areas.

Use linear regression model on different clusters separately and print the coefficients of the models individually

lets print the 3 different cluster

How using different models for different clusters will be helpful in this case and how it will be different than using one single model without clustering? Mention how it impacts performance and prediction.

On doing different models for different cluster, we can see the difference in accuracy score. So in this case, it will help to increase the model prediction and model accuracy between the cluster.

On doing single model on original data without clustering, we can able to see the more accuracy score than clustered model.


AIML Module Project - UNSUPERVISED LEARNING - Project 2

DOMAIN :- Manufacturing


As we know from the dataset has only 2 target variables i.e Quality A and Quality B. Also in graph we see the elbow point at 2.

While comparing visually, we see 0 refers to Quality A and 1 refers to Quality B

From the sample of 25 datapoint, cluster group follow follows same patterns as like origanal datapoint (i.e.,) Qualilty A = 1 and Quality B = 0


AIML Module Project - UNSUPERVISED LEARNING - Project 3

DOMAIN :- Automobile


Data import, Clean & Pre-processing

EDA

car are double in number as compare to bus and van

Implies --> 0 as Bus, 1 as car, 2 as van

  1. Many features show high correlation indicating that we need to drop multiple features- we will use PCA for the same
  2. Spread of compactness is least for van. mean compactness is highest for car. For Bus compactness is right skewed indicating that less number of buses have high compactness. 3.distribution of max.length_rectangularity is almost same for cars, bus and vans 4.Mean scaled variance is highest for cars followed by bus then vans

From above correlation matrix we can see that there are many features which are highly correlated. if we see carefully then scaled_variance.1 and scatter_ratio has correlation of 1 and many other features are also there which having more than 0.9(positive or negative) correlation e.g sekweness_about2 and hollows_ratio, scaled variance & scaled_variance1, elongatedness & scaled variance, elongatedness & scaled variance1 etc.

There are lot of dimensions with correlation above +- 0.7 and it is difficult to determine which dimensions to drop manually. We will use PCA to determine it.

Compactness Distribution has right skewed and has no outliers

No ouliers found in circularity

No ouliers found in distance_circularity

ouliers found in radius_ratio and will handle outliers in later section

ouliers found in pr.axis_aspect_ratio and will handle outliers in later section

ouliers found in max.length_aspect_ratio and will handle outliers in later section

No ouliers found in scatter ratio

ouliers found in scaled_radius_of_gyration.1 and will handle outliers in later section

ouliers found in kewness_about and will handle outliers in later section

ouliers found in skewness_about.1 and will handle outliers in later section

  1. from the above all univariant analysis, we can see some features has outliers in the dataset.
  2. Outlier present in the skewness_about.1,skewness_about,scaled_variance.1,max.length_aspect_ratio,pr.axis_aspect_ratio,radius_ratio.

Handling Outliers

Handling Outliers in radius_ratio

All the outliers found in class 2 i.e., Van. Now let see the highest inrange values of class 2

Handling Outliers in pr.axis_aspect_ratio

Index 4 and 100 belong to class 0 i.e., Bus rest all are Van

Handling Outliers in scaled_variance.1

Handling Outliers in scaled_radius_of_gyration.1

It is seen both the class has nominal range value. So there is no need to change the value

EDA

CLASSIFIERS

Model training, testing and tuning:

on seeing the different kernel --> only linear kernel trick has highest accuracy than remaining kernel

APPLYING PCA ON DATASET

Now we have high percentage of data with first 8 PC's (98%)

CLASSIFIER AFTER PCA

Looking at Confusion matric :Model predicted 

    Predicted van 47/53 (87%)
    Predicted car 116/121(96%)
    Predicted bus 64/67(96%)

    Over all accuracy 94% 

CONCLUSION

With all attributes we have got 96% on test data.

On reducing the dimension from 18 to 8 , we have got the overall accuracy of the model is 87% on training data and 86% on test data.

> Here PCA helped to reduce the 18 attributes to 8 attributes without much loss in original data.

> Also helped in computation's time.


AIML Module Project - UNSUPERVISED LEARNING - Project 4

DOMAIN :- Sports management


Lets see the each feature distribution

Runs,avg,4's,6's,HF attributes have positive correlation

  1. As seen in correlation Runs has high positive with 4's,6's and HF

Detailed EDA

Implementing PCA

The most important feature is Runs which has 70% of the data.

Here we seperate all the player into Grade A player and Grade B player for the sports management company to make business decisions.

Ranking for Grade A & Grade B also given to help the company decisions on players.


AIML Module Project - UNSUPERVISED LEARNING - Project 5


List down all possible dimensionality reduction techniques that can be implemented using python

image.png

Missing Value Ratio: If the dataset has too many missing values, we use this approach to reduce the number of variables. We can drop the variables having a large number of missing values in them

Low Variance filter: We apply this approach to identify and drop constant variables from the dataset. The target variable is not unduly affected by variables with low variance, and hence these variables can be safely dropped

High Correlation filter: A pair of variables having high correlation increases multicollinearity in the dataset. So, we can use this technique to find highly correlated features and drop them accordingly

Random Forest: This is one of the most commonly used techniques which tells us the importance of each feature present in the dataset. We can find the importance of each feature and keep the top most features, resulting in dimensionality reduction Both Backward Feature Elimination and Forward Feature Selection techniques take a lot of computational time and are thus generally used on smaller datasets

Factor Analysis: This technique is best suited for situations where we have highly correlated set of variables. It divides the variables based on their correlation into different groups, and represents each group with a factor

Principal Component Analysis: This is one of the most widely used techniques for dealing with linear data. It divides the data into a set of components which try to explain as much variance as possible

Independent Component Analysis: We can use ICA to transform the data into independent components which describe the data using less number of components

ISOMAP: We use this technique when the data is strongly non-linear

t-SNE: This technique also works well when the data is strongly non-linear. It works extremely well for visualizations as well

UMAP: This technique works well for high dimensional data. Its run-time is shorter as compared to t-SNE

So far you have used dimensional reduction on numeric data. Is it possible to do the same on a multimedia data [images and video] and text data ? Please illustrate your findings using a simple implementation on python

so we've gone from 4096 dimensions to just 292! But how good is this actually?

Let's train PCA on our training set and transform the data, then print out an example

You can see it's far from perfect, but it's still clear what shape the hand is making

And as you can see we've taken this simple model from ~30% accuracy on the test set to ~63%

We can use PCA on Image data also and we have achevied increase in accuracy from without pca to with pca

Example 2

With PCA dimensionality reduction we can able to reduce the 158 D to 13 D and we can able to see the Original Vs PCA image at last.